Introduction

  • This notebook is built to analyze in detail on what all could be done using the data provided about all players.
  • This notebook would have an in-depth analysis of the some of the main data features
  • Interesting Ideas:
    • come up with dream team (people who are best in all positions)
    • Top 3 clubs that are good in 'Attacking' and it's top 3 contributers
    • Top 3 clubs that are good in 'Defense' and it's top 3 contributers
    • Best club overall and it's top 3 contributers
    • Top 3 nations that has best footballers
    • Next best player according to wages
  • How wage and players are related
  • Players who aren't performing after age
  • Come up with Id card that would show person's Skills, team, name, weiht, height, income
  • group subset skills under common skills
  • Which position would a person wanna get trained if he wants to make it quickly?

Information about the data given

  • Columns
  • row number
  • IDunique id for every player
  • Namename
  • Ageage
  • Photourl to the player's photo
  • Nationalitynationality
  • Flagurl to players's country flag
  • Overalloverall rating
  • Potentialpotential rating
  • Clubcurrent club
  • Club Logourl to club logo
  • Valuecurrent market value
  • Wagecurrent wage
  • Special
  • Preferred Footleft/right
  • International Reputationrating on scale of 5
  • Weak Footrating on scale of 5
  • Skill Movesrating on scale of 5
  • Work Rateattack work rate/defence work rate
  • Body Typebody type of player
  • Real Face
  • Positionposition on the pitch
  • Jersey Numberjersey number
  • Joinedjoined date
  • Loaned Fromclub name if applicable
  • Contract Valid Untilcontract end date
  • Heightheight of the player
  • Weightweight of the player
  • LS rating on scale of 100
  • ST rating on scale of 100
  • RS rating on scale of 100
  • LW rating on scale of 100
  • LF rating on scale of 100
  • CF rating on scale of 100
  • RF rating on scale of 100
  • RW rating on scale of 100
  • LAM rating on scale of 100
  • CAM rating on scale of 100
  • RAM rating on scale of 100
  • LM rating on scale of 100
  • LCM rating on scale of 100
  • CM rating on scale of 100
  • RCM rating on scale of 100
  • RM rating on scale of 100
  • LWB rating on scale of 100
  • LDM rating on scale of 100
  • CDM rating on scale of 100
  • RDM rating on scale of 100
  • RWB rating on scale of 100
  • LB rating on scale of 100
  • LCB rating on scale of 100
  • CB rating on scale of 100
  • RCB rating on scale of 100
  • RB rating on scale of 100
  • Crossing rating on scale of 100
  • Finishing rating on scale of 100
  • HeadingAccuracy rating on scale of 100
  • ShortPassing rating on scale of 100
  • Volleys rating on scale of 100
  • Dribbling rating on scale of 100
  • Curverating on scale of 100
  • FKAccuracy rating on scale of 100
  • LongPassing rating on scale of 100
  • BallControl rating on scale of 100
  • Acceleration rating on scale of 100
  • SprintSpeed rating on scale of 100
  • Agility rating on scale of 100
  • Reactions rating on scale of 100
  • Balance rating on scale of 100
  • ShotPower rating on scale of 100
  • Jumping rating on scale of 100
  • Stamina rating on scale of 100
  • Strength rating on scale of 100
  • LongShots rating on scale of 100
  • Aggression rating on scale of 100
  • Interceptions rating on scale of 100
  • Positioning rating on scale of 100
  • Vision rating on scale of 100
  • Penalties rating on scale of 100
  • Composure rating on scale of 100
  • Marking rating on scale of 100
  • StandingTackle rating on scale of 100
  • SlidingTackle rating on scale of 100
  • GK Diving rating on scale of 100
  • GK Handling rating on scale of 100
  • GK Kicking rating on scale of 100
  • GK Positioning rating on scale of 100
  • GK Reflexes rating on scale of 100
  • Release Clauserelease clause value

Importing the packages needed

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import tensorflow as tf
from math import pi

Load and prepare the data

In [2]:
data_path = "data.csv"
df = pd.read_csv(data_path)

Initial Data Inspection

In [3]:
df.columns
Out[3]:
Index(['Unnamed: 0', 'ID', 'Name', 'Age', 'Photo', 'Nationality', 'Flag',
       'Overall', 'Potential', 'Club', 'Club Logo', 'Value', 'Wage', 'Special',
       'Preferred Foot', 'International Reputation', 'Weak Foot',
       'Skill Moves', 'Work Rate', 'Body Type', 'Real Face', 'Position',
       'Jersey Number', 'Joined', 'Loaned From', 'Contract Valid Until',
       'Height', 'Weight', 'LS', 'ST', 'RS', 'LW', 'LF', 'CF', 'RF', 'RW',
       'LAM', 'CAM', 'RAM', 'LM', 'LCM', 'CM', 'RCM', 'RM', 'LWB', 'LDM',
       'CDM', 'RDM', 'RWB', 'LB', 'LCB', 'CB', 'RCB', 'RB', 'Crossing',
       'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling',
       'Curve', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration',
       'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower',
       'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression',
       'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure',
       'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 'GKHandling',
       'GKKicking', 'GKPositioning', 'GKReflexes', 'Release Clause'],
      dtype='object')
In [4]:
df.head()
Out[4]:
Unnamed: 0 ID Name Age Photo Nationality Flag Overall Potential Club ... Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes Release Clause
0 0 158023 L. Messi 31 https://cdn.sofifa.org/players/4/19/158023.png Argentina https://cdn.sofifa.org/flags/52.png 94 94 FC Barcelona ... 96.0 33.0 28.0 26.0 6.0 11.0 15.0 14.0 8.0 €226.5M
1 1 20801 Cristiano Ronaldo 33 https://cdn.sofifa.org/players/4/19/20801.png Portugal https://cdn.sofifa.org/flags/38.png 94 94 Juventus ... 95.0 28.0 31.0 23.0 7.0 11.0 15.0 14.0 11.0 €127.1M
2 2 190871 Neymar Jr 26 https://cdn.sofifa.org/players/4/19/190871.png Brazil https://cdn.sofifa.org/flags/54.png 92 93 Paris Saint-Germain ... 94.0 27.0 24.0 33.0 9.0 9.0 15.0 15.0 11.0 €228.1M
3 3 193080 De Gea 27 https://cdn.sofifa.org/players/4/19/193080.png Spain https://cdn.sofifa.org/flags/45.png 91 93 Manchester United ... 68.0 15.0 21.0 13.0 90.0 85.0 87.0 88.0 94.0 €138.6M
4 4 192985 K. De Bruyne 27 https://cdn.sofifa.org/players/4/19/192985.png Belgium https://cdn.sofifa.org/flags/7.png 91 92 Manchester City ... 88.0 68.0 58.0 51.0 15.0 13.0 5.0 10.0 13.0 €196.4M

5 rows × 89 columns

In [5]:
df.describe()
Out[5]:
Unnamed: 0 ID Age Overall Potential Special International Reputation Weak Foot Skill Moves Jersey Number ... Penalties Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes
count 18207.000000 18207.000000 18207.000000 18207.000000 18207.000000 18207.000000 18159.000000 18159.000000 18159.000000 18147.000000 ... 18159.000000 18159.000000 18159.000000 18159.000000 18159.000000 18159.000000 18159.000000 18159.000000 18159.000000 18159.000000
mean 9103.000000 214298.338606 25.122206 66.238699 71.307299 1597.809908 1.113222 2.947299 2.361308 19.546096 ... 48.548598 58.648274 47.281623 47.697836 45.661435 16.616223 16.391596 16.232061 16.388898 16.710887
std 5256.052511 29965.244204 4.669943 6.908930 6.136496 272.586016 0.394031 0.660456 0.756164 15.947765 ... 15.704053 11.436133 19.904397 21.664004 21.289135 17.695349 16.906900 16.502864 17.034669 17.955119
min 0.000000 16.000000 16.000000 46.000000 48.000000 731.000000 1.000000 1.000000 1.000000 1.000000 ... 5.000000 3.000000 3.000000 2.000000 3.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 4551.500000 200315.500000 21.000000 62.000000 67.000000 1457.000000 1.000000 3.000000 2.000000 8.000000 ... 39.000000 51.000000 30.000000 27.000000 24.000000 8.000000 8.000000 8.000000 8.000000 8.000000
50% 9103.000000 221759.000000 25.000000 66.000000 71.000000 1635.000000 1.000000 3.000000 2.000000 17.000000 ... 49.000000 60.000000 53.000000 55.000000 52.000000 11.000000 11.000000 11.000000 11.000000 11.000000
75% 13654.500000 236529.500000 28.000000 71.000000 75.000000 1787.000000 1.000000 3.000000 3.000000 26.000000 ... 60.000000 67.000000 64.000000 66.000000 64.000000 14.000000 14.000000 14.000000 14.000000 14.000000
max 18206.000000 246620.000000 45.000000 94.000000 95.000000 2346.000000 5.000000 5.000000 5.000000 99.000000 ... 92.000000 96.000000 94.000000 93.000000 91.000000 90.000000 92.000000 91.000000 90.000000 94.000000

8 rows × 44 columns

In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 89 columns):
Unnamed: 0                  18207 non-null int64
ID                          18207 non-null int64
Name                        18207 non-null object
Age                         18207 non-null int64
Photo                       18207 non-null object
Nationality                 18207 non-null object
Flag                        18207 non-null object
Overall                     18207 non-null int64
Potential                   18207 non-null int64
Club                        17966 non-null object
Club Logo                   18207 non-null object
Value                       18207 non-null object
Wage                        18207 non-null object
Special                     18207 non-null int64
Preferred Foot              18159 non-null object
International Reputation    18159 non-null float64
Weak Foot                   18159 non-null float64
Skill Moves                 18159 non-null float64
Work Rate                   18159 non-null object
Body Type                   18159 non-null object
Real Face                   18159 non-null object
Position                    18147 non-null object
Jersey Number               18147 non-null float64
Joined                      16654 non-null object
Loaned From                 1264 non-null object
Contract Valid Until        17918 non-null object
Height                      18159 non-null object
Weight                      18159 non-null object
LS                          16122 non-null object
ST                          16122 non-null object
RS                          16122 non-null object
LW                          16122 non-null object
LF                          16122 non-null object
CF                          16122 non-null object
RF                          16122 non-null object
RW                          16122 non-null object
LAM                         16122 non-null object
CAM                         16122 non-null object
RAM                         16122 non-null object
LM                          16122 non-null object
LCM                         16122 non-null object
CM                          16122 non-null object
RCM                         16122 non-null object
RM                          16122 non-null object
LWB                         16122 non-null object
LDM                         16122 non-null object
CDM                         16122 non-null object
RDM                         16122 non-null object
RWB                         16122 non-null object
LB                          16122 non-null object
LCB                         16122 non-null object
CB                          16122 non-null object
RCB                         16122 non-null object
RB                          16122 non-null object
Crossing                    18159 non-null float64
Finishing                   18159 non-null float64
HeadingAccuracy             18159 non-null float64
ShortPassing                18159 non-null float64
Volleys                     18159 non-null float64
Dribbling                   18159 non-null float64
Curve                       18159 non-null float64
FKAccuracy                  18159 non-null float64
LongPassing                 18159 non-null float64
BallControl                 18159 non-null float64
Acceleration                18159 non-null float64
SprintSpeed                 18159 non-null float64
Agility                     18159 non-null float64
Reactions                   18159 non-null float64
Balance                     18159 non-null float64
ShotPower                   18159 non-null float64
Jumping                     18159 non-null float64
Stamina                     18159 non-null float64
Strength                    18159 non-null float64
LongShots                   18159 non-null float64
Aggression                  18159 non-null float64
Interceptions               18159 non-null float64
Positioning                 18159 non-null float64
Vision                      18159 non-null float64
Penalties                   18159 non-null float64
Composure                   18159 non-null float64
Marking                     18159 non-null float64
StandingTackle              18159 non-null float64
SlidingTackle               18159 non-null float64
GKDiving                    18159 non-null float64
GKHandling                  18159 non-null float64
GKKicking                   18159 non-null float64
GKPositioning               18159 non-null float64
GKReflexes                  18159 non-null float64
Release Clause              16643 non-null object
dtypes: float64(38), int64(6), object(45)
memory usage: 12.4+ MB
In [7]:
df.isna().sum()
Out[7]:
Unnamed: 0                      0
ID                              0
Name                            0
Age                             0
Photo                           0
Nationality                     0
Flag                            0
Overall                         0
Potential                       0
Club                          241
Club Logo                       0
Value                           0
Wage                            0
Special                         0
Preferred Foot                 48
International Reputation       48
Weak Foot                      48
Skill Moves                    48
Work Rate                      48
Body Type                      48
Real Face                      48
Position                       60
Jersey Number                  60
Joined                       1553
Loaned From                 16943
Contract Valid Until          289
Height                         48
Weight                         48
LS                           2085
ST                           2085
                            ...  
Dribbling                      48
Curve                          48
FKAccuracy                     48
LongPassing                    48
BallControl                    48
Acceleration                   48
SprintSpeed                    48
Agility                        48
Reactions                      48
Balance                        48
ShotPower                      48
Jumping                        48
Stamina                        48
Strength                       48
LongShots                      48
Aggression                     48
Interceptions                  48
Positioning                    48
Vision                         48
Penalties                      48
Composure                      48
Marking                        48
StandingTackle                 48
SlidingTackle                  48
GKDiving                       48
GKHandling                     48
GKKicking                      48
GKPositioning                  48
GKReflexes                     48
Release Clause               1564
Length: 89, dtype: int64

Inference

There are a lot of null values on the data

Data Cleaning

Take out Lbs from Weight

In [8]:
def weight_correction(df):
    try:
        value = float(df[:-3])
    except:
        value = 0
    return value
df['Weight'] = df.Weight.apply(weight_correction)
In [9]:
df.Weight = pd.to_numeric(df.Weight)
In [10]:
df.Weight = df.Weight.replace(0, np.nan)

Change value and wage to a real number

In [11]:
def value_to_int(df_value):
    try:
        value = float(df_value[1:-1])
        suffix = df_value[-1:]
        if suffix == 'M':
            value = value * 1000000
        elif suffix == 'K':
            value = value * 1000
    except ValueError:
        value = 0
    return value

df['Value'] = df['Value'].apply(value_to_int)
df['Wage'] = df['Wage'].apply(value_to_int)

df.Value = df.Value.replace(0, np.nan)
df.Wage = df.Wage.replace(0, np.nan)

Fill the Nan Values

  • We will try to fill everything with some logic behind it

Fixing Weight

In [12]:
df.Weight.isna().sum()
Out[12]:
48
In [13]:
df.Weight.mean()
Out[13]:
165.97912880665234

According to livestrong data,

  • The normal weight range for a player 5 feet 9 inches tall is 136 to 169 pounds.
  • Since the mean weight of the player is 165 pounds and it gels with the global data, we could set the mean weight to fill the null values
In [14]:
df['Weight'].fillna(df.Weight.mean(), inplace = True)

Fixing Height

In [15]:
df.Height.isna().sum()
Out[15]:
48
In [16]:
plt.figure(figsize = (20, 10))
sns.countplot(x='Height', data=df)
plt.show()

According to livestrong data,

  • The height of a player varies according to the position he takes
  • Their average height was 5 feet, 11 1/2 inches tall
  • We also find from the data given that most people are between 5.9 to 6-1
  • So we would fill the height to 5'11"
In [17]:
df['Height'].fillna("5'11", inplace = True)

Fixing Weak Foot Rating

In [18]:
wf_missing = df['Weak Foot'].isna()
wf_missing.sum()
Out[18]:
48
In [19]:
weak_foot_prob = df['Weak Foot'].value_counts(normalize=True)
weak_foot_prob
Out[19]:
3.0    0.624979
2.0    0.207115
4.0    0.146594
5.0    0.012611
1.0    0.008701
Name: Weak Foot, dtype: float64
  • From this we could clearly find that majority of the players would have weak foot rating as 3
  • We will also fill the Foot with the same probability distribution
In [20]:
df.loc[wf_missing,'Weak Foot'] = np.random.choice(weak_foot_prob.index, size=wf_missing.sum(),p=weak_foot_prob.values)

Fixing Preferred Foot

In [21]:
pf_missing = df['Preferred Foot'].isna()
pf_missing.sum()
Out[21]:
48
In [22]:
df['Preferred Foot'].value_counts()
Out[22]:
Right    13948
Left      4211
Name: Preferred Foot, dtype: int64
In [23]:
foot_distribution = df['Preferred Foot'].value_counts(normalize=True)
foot_distribution
Out[23]:
Right    0.768104
Left     0.231896
Name: Preferred Foot, dtype: float64
  • From the data, it's clear that 77% of people are right footed
  • So we will fill the preferred foot in the same probability distributsion
In [24]:
df.loc[pf_missing, 'Preferred Foot'] = np.random.choice(foot_distribution.index, size = pf_missing.sum(), p=foot_distribution.values)
In [25]:
df['Preferred Foot'].value_counts()
Out[25]:
Right    13984
Left      4223
Name: Preferred Foot, dtype: int64

Filling Position

In [26]:
fp_missing = df.Position.isna()
fp_missing.sum()
Out[26]:
60
In [27]:
position_prob = df.Position.value_counts(normalize=True)
position_prob 
Out[27]:
ST     0.118587
GK     0.111589
CB     0.097978
CM     0.076817
LB     0.072850
RB     0.071141
RM     0.061939
LM     0.060341
CAM    0.052791
CDM    0.052240
RCB    0.036480
LCB    0.035708
LCM    0.021767
RCM    0.021546
LW     0.020995
RW     0.020389
RDM    0.013666
LDM    0.013391
LS     0.011407
RS     0.011186
RWB    0.004794
LWB    0.004298
CF     0.004078
RAM    0.001157
LAM    0.001157
RF     0.000882
LF     0.000827
Name: Position, dtype: float64
In [28]:
plt.figure(figsize = (20, 10))
sns.countplot(x=df.Position, data=df)
plt.show()
  • From the data, it's clear that many positions have different percentage
  • So we will fill the position in the same probability distributsion
In [29]:
df.loc[fp_missing, 'Position'] = np.random.choice(position_prob.index, p=position_prob.values, size=fp_missing.sum())

Filling Skill Moves

In [30]:
fs_missing = df['Skill Moves'].isna()
fs_missing.sum()
Out[30]:
48
In [31]:
skill_moves_prob = df['Skill Moves'].value_counts(normalize=True)
skill_moves_prob
Out[31]:
2.0    0.471667
3.0    0.363456
1.0    0.111570
4.0    0.050498
5.0    0.002809
Name: Skill Moves, dtype: float64
  • We could fill the nan valeus with the same probability distribution
In [32]:
df.loc[fs_missing, 'Skill Moves'] = np.random.choice(skill_moves_prob.index, p=skill_moves_prob.values, size=fs_missing.sum())

Filling Body Type

In [33]:
bt_missing = df['Body Type'].isna()
bt_missing.sum()
Out[33]:
48
In [34]:
bt_prob = df['Body Type'].value_counts(normalize=True)
bt_prob
Out[34]:
Normal                 0.583457
Lean                   0.353378
Stocky                 0.062779
Shaqiri                0.000055
C. Ronaldo             0.000055
PLAYER_BODY_TYPE_25    0.000055
Neymar                 0.000055
Akinfenwa              0.000055
Messi                  0.000055
Courtois               0.000055
Name: Body Type, dtype: float64
  • Not sure what how 'Neymar', 'Messi', 'Shaqiri', 'Akinfenwa', 'Courtois' are listed as body types because they are they names of the football players
  • We fill the body types with the same probability distribution of the 'Normal' and 'Lean'
In [35]:
df.loc[bt_missing, 'Body Type'] = np.random.choice(['Normal', 'Lean'], p=[.63,.37], size=bt_missing.sum())

Filling Wages

In [36]:
wage_missing = df.Wage.isna()
wage_missing.sum()
Out[36]:
241
In [37]:
wage_prob = df.Wage.value_counts(normalize=True)
wage_prob
Out[37]:
1000.0      0.272737
2000.0      0.157353
3000.0      0.103362
4000.0      0.069854
5000.0      0.048369
6000.0      0.037961
7000.0      0.027162
8000.0      0.023544
9000.0      0.018257
10000.0     0.017756
11000.0     0.016253
12000.0     0.014249
13000.0     0.012635
15000.0     0.011132
14000.0     0.010464
17000.0     0.008961
18000.0     0.008460
16000.0     0.007570
20000.0     0.007459
19000.0     0.007236
22000.0     0.007125
21000.0     0.006178
24000.0     0.005677
26000.0     0.005288
25000.0     0.005121
23000.0     0.005009
27000.0     0.003451
29000.0     0.003395
31000.0     0.003340
30000.0     0.003340
              ...   
170000.0    0.000167
150000.0    0.000167
87000.0     0.000167
90000.0     0.000167
215000.0    0.000167
355000.0    0.000167
315000.0    0.000167
97000.0     0.000111
83000.0     0.000111
185000.0    0.000111
260000.0    0.000111
340000.0    0.000111
210000.0    0.000111
200000.0    0.000056
235000.0    0.000056
225000.0    0.000056
245000.0    0.000056
250000.0    0.000056
190000.0    0.000056
230000.0    0.000056
565000.0    0.000056
455000.0    0.000056
380000.0    0.000056
255000.0    0.000056
290000.0    0.000056
405000.0    0.000056
93000.0     0.000056
300000.0    0.000056
265000.0    0.000056
420000.0    0.000056
Name: Wage, Length: 143, dtype: float64
  • Since all good and recognized players have good wage, the wage wouldn't be filled only for players who aren't famous
  • The wage distribution says that most of the players have very less wages
  • We should not be filling it with mean because the mean would be really high
  • So the players who are not so famous might get higher wages when compared to others who have the same talent
  • So we would fill the Nan wage columns with the probability distribution of the data
In [38]:
df.loc[wage_missing, 'Wage'] = np.random.choice(wage_prob.index, p=wage_prob.values, size=wage_missing.sum())

Filling the rest of the valeus

  • Since all features that have float64 datatypehas continuos values, we will fill it's Nan values with mean
  • Randomly fill 'Contract Valid Until', 'Work Rate', 'International Reputation' , 'Jersey Number', 'Club',
In [39]:
for feature in df.columns:
    if df[feature].dtype == 'float64':
        df[feature].fillna(df[feature].mean(), inplace=True)
    
df['Contract Valid Until'].fillna(np.random.choice(df['Contract Valid Until']), inplace = True)
df['Loaned From'].fillna(np.random.choice(df['Loaned From']), inplace = True)
df['Joined'].fillna(np.random.choice(df['Joined']), inplace = True)
df['Jersey Number'].fillna(np.random.choice(df['Jersey Number']), inplace = True)
df['Club'].fillna(np.random.choice(df.Club), inplace = True)
df['Work Rate'].fillna(np.random.choice(df['Work Rate']), inplace = True)
df['International Reputation'].fillna(np.random.choice(df['International Reputation']), inplace = True)

Fill the rest of the NaN data with 0

In [40]:
df.fillna(0, inplace = True)

Grouping similar skills together

  • Here, we are grouping the skills together and generalizing it to 8 categories
  • These 8 categories would let us know which position that player would take
  • We do this because we could analyze the players better and positon them accordingly
In [41]:
def defending(data):
    return data[['Marking', 'StandingTackle', 
                               'SlidingTackle']].mean().mean()

def general(data):
    return data[['HeadingAccuracy', 'Dribbling', 'Curve', 
                               'BallControl']].mean().mean()

def mental(data):
    return data[['Aggression', 'Interceptions', 'Positioning', 
                               'Vision','Composure']].mean().mean()

def passing(data):
    return data[['Crossing', 'ShortPassing', 
                               'LongPassing']].mean().mean()

def mobility(data):
    return data[['Acceleration', 'SprintSpeed', 
                               'Agility','Reactions']].mean().mean()
def power(data):
    return data[['Balance', 'Jumping', 'Stamina', 
                               'Strength']].mean().mean()

def rating(data):
    return data[['Potential', 'Overall']].mean().mean()

def shooting(data):
    return data[['Finishing', 'Volleys', 'FKAccuracy', 
                               'ShotPower','LongShots', 'Penalties']].mean().mean()
In [42]:
# renaming a column
df.rename(columns={'Club Logo':'Club_Logo'}, inplace=True)

# adding these categories to the data

df['Defending'] = df.apply(defending, axis = 1)
df['General'] = df.apply(general, axis = 1)
df['Mental'] = df.apply(mental, axis = 1)
df['Passing'] = df.apply(passing, axis = 1)
df['Mobility'] = df.apply(mobility, axis = 1)
df['Power'] = df.apply(power, axis = 1)
df['Rating'] = df.apply(rating, axis = 1)
df['Shooting'] = df.apply(shooting, axis = 1)
In [43]:
players = df[['Name','Defending','General','Mental','Passing',
                'Mobility','Power','Rating','Shooting','Flag','Age',
                'Nationality', 'Photo', 'Club_Logo', 'Club']]

Skills/Position Analysis

Number of footballers available in each position

In [44]:
plt.figure(figsize = (20, 10))
ax = sns.countplot(x='Position', data=df, order = df['Position'].value_counts().index)
ax.set_title(label = 'Number of footballers available in each position', fontsize = 20)
plt.show()

Analyzing top 5 features on all skills

In [45]:
player_features = (
    'Acceleration', 'Aggression', 'Agility', 
    'Balance', 'BallControl', 'Composure', 
    'Crossing', 'Dribbling', 'FKAccuracy', 
    'Finishing', 'GKDiving', 'GKHandling', 
    'GKKicking', 'GKPositioning', 'GKReflexes', 
    'HeadingAccuracy', 'Interceptions', 'Jumping', 
    'LongPassing', 'LongShots', 'Marking', 'Penalties'
)

from math import pi
idx = 1
plt.figure(figsize=(15,45))
for position_name, features in df.groupby(df['Position'])[player_features].mean().iterrows():
    top_features = dict(features.nlargest(5))
    
    # number of variable
    categories=top_features.keys()
    N = len(categories)

    # We are going to plot the first line of the data frame.
    # But we need to repeat the first value to close the circular graph:
    values = list(top_features.values())
    values += values[:1]

    # What will be the angle of each axis in the plot? (we divide the plot / number of variable)
    angles = [n / float(N) * 2 * pi for n in range(N)]
    angles += angles[:1]

    # Initialise the spider plot
    ax = plt.subplot(10, 3, idx, polar=True)

    # Draw one axe per variable + add labels labels yet
    plt.xticks(angles[:-1], categories, color='grey', size=8)
    
    # Draw ylabels
    ax.set_rlabel_position(0)
    plt.yticks([25,50,75], ["25","50","75"], color="grey", size=7)
    plt.ylim(0,100)
    
    plt.subplots_adjust(hspace = 0.5)
    
    # Plot data
    ax.plot(angles, values, linewidth=1, linestyle='solid')

    # Fill area
    ax.fill(angles, values, 'b', alpha=0.1)
    
    plt.title(position_name, size=11, y=1.1)
    
    idx += 1

Inference

  • From the position graph and the skill graph above, we get to know the most important skills that are required for each position
  • This helps:
    • the management to look out for the right set of skills from each players and to buy players with the skills they are looking for
    • the to-be professionals to understand which skill they need to develop in order to get the position they need

More stats on skills and special moves

In [46]:
sns.set(style = 'dark', palette = 'colorblind', color_codes = True)
x = df.Special
plt.figure(figsize = (12, 8))
ax = sns.distplot(x, bins = 50, kde = False, color = 'm')
ax.set_xlabel(xlabel = 'Special score range', fontsize = 16)
ax.set_ylabel(ylabel = 'Count of the Players',fontsize = 16)
ax.set_title(label = 'Histogram for the Speciality Scores of the Players', fontsize = 20)
plt.show()
In [47]:
sns.scatterplot(x = 'Special', y='Wage', data=df)
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2ef1f320>

Inference

  • Skills and wage are not completely positively correlated

Top 5 nations that have the best skilled footballers

In [48]:
plt.rcParams['figure.figsize'] = (20, 10)
skill_df = df[df['Skill Moves'] == 5][['Name','Nationality']]
sns.countplot(x='Nationality', data=skill_df, order=skill_df.Nationality.value_counts().iloc[:5].index)
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2f297b00>

Awards Section

  • This section would have the top 5/ top 3 contibutors for many sections

Top footballer producing nations

In [49]:
import squarify
df.Nationality.value_counts().nlargest(5).plot(kind='bar')
Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2f2ce1d0>

Player weight distribution in top 5 footballer producing countries

In [50]:
countries = df.Nationality.value_counts().nlargest(5).index
In [51]:
data_countries = df[df['Nationality'].isin(countries)]
In [52]:
plt.rcParams['figure.figsize'] = (12, 7)

ax = sns.violinplot(x = data_countries['Nationality'], y = data_countries['Weight'], palette = 'colorblind')
ax.set_xlabel(xlabel = 'Countries', fontsize = 9)
ax.set_ylabel(ylabel = 'Weight in lbs', fontsize = 9)
ax.set_title(label = 'Distribution of Weight of players from different countries', fontsize = 20)
/Users/sganesh/anaconda3/envs/tensorflow/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[52]:
Text(0.5, 1.0, 'Distribution of Weight of players from different countries')

Club-level Analysis

In [53]:
import matplotlib.image as mpimg
import requests
def print_club_flag(clubs):
    fig = plt.figure(figsize=(10,10))
    for index, club in enumerate(clubs):
        logo = df[df['Club'] == club]['Club_Logo'].iloc[0]
        logo_image = "img_club_logo.jpg"
        logo_flag = requests.get(logo).content
        with open(logo_image, 'wb') as handler:
            handler.write(logo_flag)
        img=mpimg.imread(logo_image)
        ax = fig.add_subplot(1, 6, index+1, xticks=[], yticks=[])
        fig.tight_layout()
        ax.imshow(img, interpolation="lanczos")
        ax.set_title("%d. %s" %(index+1, club))
    
def print_national_flag(nations):
    fig = plt.figure(figsize=(10, 10))
    for index, nation in enumerate(nations):
        logo = df[df['Nationality'] == nation]['Flag'].iloc[0]
        logo_image = "img_nation_logo.jpg"
        logo_flag = requests.get(logo).content
        with open(logo_image, 'wb') as handler:
            handler.write(logo_flag)
        img=mpimg.imread(logo_image)
        ax = fig.add_subplot(1, 6, index+1, xticks=[], yticks=[])
        fig.tight_layout()
        ax.imshow(img, interpolation="lanczos")
        ax.set_title("%d. %s" %(index+1, nation))

Best football clubs

  • Here are the top 5 football clubs w.r.t their overall rating
In [54]:
d = {'Overall': 'Average_Rating'}
best_overall_club_df = df.groupby('Club').agg({'Overall':'mean'}).rename(columns=d)
clubs = best_overall_club_df.Average_Rating.nlargest(5).index
clubs_list = []

print_club_flag(clubs)

Clubs that have the best Attack

  • Here are the top 5 clubs that specialize in attack
In [55]:
attck_list = ['Shooting', 'Power', 'Passing']

best_attack_df = players.groupby('Club')[attck_list].sum().sum(axis=1)
clubs = best_attack_df.nlargest(5).index

print_club_flag(clubs)

Clubs that have the best Defense

In [56]:
best_defense_df = players.groupby('Club')['Defending'].sum()
clubs = best_defense_df.nlargest(5).index
print_club_flag(clubs)

    

Nation-level Analysis

Best footballing nations

In [57]:
d = {'Overall': 'Average_Rating'}
best_overall_country_df = df.groupby('Nationality').agg({'Overall':'mean'}).rename(columns=d)
nations = best_overall_country_df.Average_Rating.nlargest(5).index
print_national_flag(nations)
/Users/sganesh/anaconda3/envs/tensorflow/lib/python3.5/site-packages/matplotlib/tight_layout.py:198: UserWarning: tight_layout cannot make axes width small enough to accommodate all axes decorations
  warnings.warn('tight_layout cannot make axes width small enough '
In [58]:
best_3_uae = df[df['Nationality'] == 'United Arab Emirates']['Overall'].nlargest(3)
print(best_3_uae)
uae_df = df[df['Nationality'] == 'United Arab Emirates']
uae_df[uae_df['Overall'].isin(best_3_uae)]['Name']
1170    77
Name: Overall, dtype: int64
Out[58]:
1170    O. Abdulrahman
Name: Name, dtype: object

Nations that has the best Attack

In [59]:
best_attack_nation_df = players.groupby('Nationality')[attck_list].sum().sum(axis=1)
nations = best_attack_nation_df.nlargest(5).index
print_national_flag(nations)

Nations that has the best Defense

In [60]:
best_defense_nation_df = players.groupby('Nationality')['Defending'].sum()
nations = best_defense_nation_df.nlargest(5).index
print_national_flag(nations)
In [61]:
import requests
import random
from math import pi

import matplotlib.image as mpimg
from matplotlib.offsetbox import (OffsetImage,AnnotationBbox)

def details(row, title, image, age, nationality, photo, logo, club):
    
    flag_image = "img_flag.jpg"
    player_image = "img_player.jpg"
    logo_image = "img_club_logo.jpg"
        
    img_flag = requests.get(image).content
    with open(flag_image, 'wb') as handler:
        handler.write(img_flag)
    
    player_img = requests.get(photo).content
    with open(player_image, 'wb') as handler:
        handler.write(player_img)
     
    logo_img = requests.get(logo).content
    with open(logo_image, 'wb') as handler:
        handler.write(logo_img)
        
    r = lambda: random.randint(0,255)
    colorRandom = '#%02X%02X%02X' % (r(),r(),r())
    
    if colorRandom == '#ffffff':colorRandom = '#a5d6a7'
    
    basic_color = '#37474f'
    color_annotate = '#01579b'
    
    img = mpimg.imread(flag_image)
    #flg_img = mpimg.imread(logo_image)
    
    plt.figure(figsize=(15,8))
    categories=list(players)[1:]
    coulumnDontUseGraph = ['Flag', 'Age', 'Nationality', 'Photo', 'Logo', 'Club']
    N = len(categories) - len(coulumnDontUseGraph)
    
    angles = [n / float(N) * 2 * pi for n in range(N)]
    angles += angles[:1]
    
    ax = plt.subplot(111, projection='polar')
    ax.set_theta_offset(pi / 2)
    ax.set_theta_direction(-1)
    plt.xticks(angles[:-1], categories, color= 'black', size=17)
    ax.set_rlabel_position(0)
    plt.yticks([25,50,75,100], ["25","50","75","100"], color= basic_color, size= 10)
    plt.ylim(0,100)
    
    values = players.loc[row].drop('Name').values.flatten().tolist() 
    valuesDontUseGraph = [image, age, nationality, photo, logo, club]
    values = [e for e in values if e not in (valuesDontUseGraph)]
    values += values[:1]
    
    ax.plot(angles, values, color= basic_color, linewidth=1, linestyle='solid')
    ax.fill(angles, values, color= colorRandom, alpha=0.5)
    axes_coords = [0, 0, 1, 1]
    ax_image = plt.gcf().add_axes(axes_coords,zorder= -1)
    ax_image.imshow(img,alpha=0.5)
    ax_image.axis('off')
    
    ax.annotate('Nationality: ' + nationality.upper(), xy=(10,10), xytext=(103, 138),
                fontsize= 12,
                color = 'white',
                bbox={'facecolor': color_annotate, 'pad': 7})
                      
    ax.annotate('Age: ' + str(age), xy=(10,10), xytext=(43, 180),
                fontsize= 15,
                color = 'white',
                bbox={'facecolor': color_annotate, 'pad': 7})
    
    ax.annotate('Team: ' + club.upper(), xy=(10,10), xytext=(92, 168),
                fontsize= 12,
                color = 'white',
                bbox={'facecolor': color_annotate, 'pad': 7})

    arr_img_player = plt.imread(player_image, format='jpg')

    imagebox_player = OffsetImage(arr_img_player)
    imagebox_player.image.axes = ax
    abPlayer = AnnotationBbox(imagebox_player, (0.5, 0.7),
                        xybox=(313, 223),
                        xycoords='data',
                        boxcoords="offset points"
                        )
    arr_img_logo = plt.imread(logo_image, format='jpg')

    imagebox_logo = OffsetImage(arr_img_logo)
    imagebox_logo.image.axes = ax
    abLogo = AnnotationBbox(imagebox_logo, (0.5, 0.7),
                        xybox=(-320, -226),
                        xycoords='data',
                        boxcoords="offset points"
                        )

    ax.add_artist(abPlayer)
    ax.add_artist(abLogo)

    plt.title(title, size=50, color= basic_color)
In [62]:
# defining a polar graph

def get_id_card(id = 0):
    if 0 <= id < len(df.ID):
        details(row = players.index[id], 
                title = players['Name'][id], 
                age = players['Age'][id], 
                photo = players['Photo'][id],
                nationality = players['Nationality'][id],
                image = players['Flag'][id], 
                logo = players['Club_Logo'][id], 
                club = players['Club'][id])
    else:
        print('The base has 17917 players. You can put positive numbers from 0 to 17917')

Top 5 footballers

  • This gives a pictorial representation of the top 5 footballers
  • Thanks Roshan sharma for the ID card code. Really well done!!!
In [63]:
best_footballers = df['Overall'].nlargest(5)
for index in best_footballers.index:
    get_id_card(index)

Dream Team

  • Ever dreamt of a team which would have all your favourite players?
  • This team below has the best players in all positions :)
In [64]:
df.loc[df.groupby(df['Position'])['Potential'].idxmax()][['Name', 'Position', 'Overall', 'Age', 'Nationality', 'Club']]
Out[64]:
Name Position Overall Age Nationality Club
31 C. Eriksen CAM 88 26 Denmark Tottenham Hotspur
42 S. Umtiti CB 87 24 France FC Barcelona
27 Casemiro CDM 88 26 Brazil Real Madrid
350 A. Milik CF 81 24 Poland Napoli
78 S. Milinković-Savić CM 85 23 Serbia Lazio
3 De Gea GK 91 27 Spain Manchester United
28 J. Rodríguez LAM 88 26 Colombia FC Bayern München
35 Marcelo LB 88 30 Brazil Real Madrid
77 M. Å kriniar LCB 85 23 Slovakia Inter
11 T. Kroos LCM 90 28 Germany Real Madrid
14 N. Kanté LDM 89 27 France Chelsea
15 P. Dybala LF 89 24 Argentina Juventus
415 H. Aouar LM 80 20 France Olympique Lyonnais
21 E. Cavani LS 89 31 Uruguay Paris Saint-Germain
2 Neymar Jr LW 92 26 Brazil Paris Saint-Germain
601 Jonny LWB 79 24 Spain Wolverhampton Wanderers
171 H. Ziyech RAM 83 25 Morocco Ajax
247 João Cancelo RB 82 24 Portugal Juventus
8 Sergio Ramos RCB 91 32 Spain Real Madrid
4 K. De Bruyne RCM 91 27 Belgium Manchester City
45 P. Pogba RDM 87 25 France Manchester United
0 L. Messi RF 94 31 Argentina FC Barcelona
25 K. Mbappé RM 88 19 France Paris Saint-Germain
7 L. Suárez RS 91 31 Uruguay FC Barcelona
79 Marco Asensio RW 85 22 Spain Real Madrid
766 Pablo Maffeo RWB 78 20 Spain VfB Stuttgart
1 Cristiano Ronaldo ST 94 33 Portugal Juventus

Wage Analysis

In [65]:
#### sns.set(style = 'dark', palette = 'colorblind', color_codes = True)
x = df.Wage
plt.figure(figsize = (12, 8))
ax = sns.distplot(x, bins = 50, kde = False, color = 'm')
ax.set_xlabel(xlabel = 'Player Wage', fontsize = 16)
ax.set_ylabel(ylabel = 'Player Count',fontsize = 16)
ax.set_title(label = 'Histogram that shows the wage of the Players', fontsize = 20)
plt.show()
In [66]:
df[df['Wage']>300000][['Name','Age','Wage']]
Out[66]:
Name Age Wage
0 L. Messi 31 565000.0
1 Cristiano Ronaldo 33 405000.0
4 K. De Bruyne 27 355000.0
5 E. Hazard 27 340000.0
6 L. Modrić 32 420000.0
7 L. Suárez 31 455000.0
8 Sergio Ramos 32 380000.0
11 T. Kroos 28 355000.0
20 Sergio Busquets 29 315000.0
28 J. Rodríguez 26 315000.0
30 Isco 26 315000.0
32 Coutinho 26 340000.0
36 G. Bale 28 355000.0

Inference

  • Looks like the wage is highly skewed
  • Only a handful of people get more than 300,000 Euros
In [67]:
df.groupby('Wage')['Overall'].mean().plot()
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a3123f048>

Inference

  • It's very evident that the wage is getting higher only for star performers

Age Analysis

In [68]:
df.groupby('Age')['Overall'].mean().plot()
Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2edb5780>

Inference

  • The overall performance of the players dips after 30
  • Let us look at why it has gone up after 43
In [69]:
sns.countplot(x='Age', data=df)
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a3048b780>
In [70]:
df[df['Age']>40][['Name','Overall','Age','Nationality']]
Out[70]:
Name Overall Age Nationality
1120 J. Villar 77 41 Paraguay
4228 B. Nivet 71 41 France
4741 O. Pérez 71 45 Mexico
7225 C. Muñoz 68 41 Argentina
10545 S. Narazaki 65 42 Japan
12192 H. Sulaimani 63 41 Saudi Arabia
15426 M. Tyler 59 41 England
17726 T. Warner 53 44 Trinidad & Tobago
18183 K. Pilkington 48 44 England

Inference

  • There is only a handful of people are there > 40
  • Mr.Perez is surely an outlier and a Mexico's pride!!!
In [87]:
new_wage = df[df['Wage']>10000]
new_wage['age_group'] = pd.cut(new_wage.Age, bins=4)
#new_wage.plot(x='age_group', y='Wage', kind = 'bar')
ax = new_wage.boxplot(column='Wage', by='age_group', showmeans=True)
ax.set_xlabel(xlabel = 'Age Group', fontsize = 20)
ax.set_ylabel(ylabel = 'Wage', fontsize = 20)
/Users/sganesh/anaconda3/envs/tensorflow/lib/python3.5/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[87]:
Text(0, 0.5, 'Wage')

Inference

  • Players Are In High Demand In Their Mid-20s

Next Steps

  • This section would involve some more analysis predictions like:
    • Who would be the next big star?
    • What all would contribute to get a better salary?
    • Please comment on which predctions/analysis would you need
In [72]:
"""positions = ['CAM', 'CB', 'CDM', 'CF', 'CM', 'LAM',
       'LB', 'LCB', 'LCM', 'LDM', 'LF', 'LM', 'LS', 'LW', 'LWB', 'RAM', 'RB', 'RCB', 'RCM', 'RDM', 'RF',
       'RM', 'RS', 'RW', 'RWB']"""
Out[72]:
"positions = ['CAM', 'CB', 'CDM', 'CF', 'CM', 'LAM',\n       'LB', 'LCB', 'LCM', 'LDM', 'LF', 'LM', 'LS', 'LW', 'LWB', 'RAM', 'RB', 'RCB', 'RCM', 'RDM', 'RF',\n       'RM', 'RS', 'RW', 'RWB']"
In [73]:
"""for i in positions:
    print('\n\n','Top 10', i, 'in FIFA 19', '\n')
    temp_df = df[df.Position == i]
    print(temp_df.sort_values(i, ascending=False).head(10).reset_index()[['Name', i]])

    
#print(df.sort_values(temp_df, ascending=False).head(10).reset_index()[['Name', 'Nationality', 'Club', 'Overall']])"""
Out[73]:
"for i in positions:\n    print('\n\n','Top 10', i, 'in FIFA 19', '\n')\n    temp_df = df[df.Position == i]\n    print(temp_df.sort_values(i, ascending=False).head(10).reset_index()[['Name', i]])\n\n    \n#print(df.sort_values(temp_df, ascending=False).head(10).reset_index()[['Name', 'Nationality', 'Club', 'Overall']])"